[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

sharkinsspatial · 2024-04-22T18:37:25Z

This is a rudimentary initial implementation for #78. The core code is ported directly from kerchunk's hdf backend. I have not ported the bulk of the kerchunk backend's specialized encoding translation logic but I'll try to do so incrementally so that we can build complete test coverage for the many edge cases it currently covers.

TomNicholas

This is looking great so far @sharkinsspatial !

kerchunk backend's specialized encoding translation logic

This part I would really like to either factor out, or at a least really understand what it's doing. See #68

TomNicholas · 2024-04-22T20:53:03Z

virtualizarr/readers/hdf.py

@@ -0,0 +1,206 @@
+from typing import List, Mapping, Optional
+
+import fsspec


Does one need fsspec if reading a local file? Is there any other way to read from S3 without fsspec at all?

Not with a filesystem-like API. You would have to use boto3 or aiobotocore directly.

This is one of the great virtues of fsspec and is not to be under-valued.

TomNicholas · 2024-04-22T20:56:06Z

virtualizarr/readers/hdf.py

+def virtual_vars_from_hdf(
+ path: str,
+ drop_variables: Optional[List[str]] = None,
+) -> Mapping[str, xr.Variable]:


I like this an a way to interface with the code in open_virtual_dataset

rabernat · 2024-04-22T21:42:46Z

This looks cool @sharkinsspatial!

My opinion is that it doesn't make sense to just forklift the kerchunk code into virtualizarr. What I would love to see is an extremely tight, strictly typed, unit-tested total refactor of the parsing logic. I think you're headed down the right path, but I encourage you to push as far as you can in that direction.

for more information, see https://pre-commit.ci

sharkinsspatial · 2024-05-13T19:26:21Z

@rabernat Fully agree with your take above 👆 👍 . I'm trying to work through this incrementally whenever I can find some spare time. In the spirit of thorough test coverage 🎊 looking through your issue pydata/xarray#7388 and the corresponding PR I'm not sure what the proper incantation of variable encoding configuration is to use blosc with the netcdf4 engine? Do you have an example of this that you can provide?

for more information, see https://pre-commit.ci

…Zarr into hdf5_reader

TomNicholas · 2024-07-01T18:00:45Z

virtualizarr/readers/hdf_filters.py

+ if mapping["scale_factor"] != 1 or mapping["add_offset"] != 0:
+ float_dtype = _choose_float_dtype(dtype=dataset.dtype, mapping=mapping)
+ target_dtype = np.dtype(float_dtype)
+ codec = FixedScaleOffset(


Are you able to make this test parametrization pass with this PR? It's currently xfailed because open_virtual_dataset doesn't know how to handle scale factor encoding.

I might be misunderstanding, but none of the hdf reader code will be called for loadable_variables and this block would only be entered for a loaded variable. Is that correct?

…stream.

…tion.

ghidalgo3 · 2024-07-18T19:28:24Z

virtualizarr/readers/hdf.py

+
+ shape = tuple(math.ceil(a / b) for a, b in zip(dataset.shape, dataset.chunks))
+ paths = np.empty(shape, dtype=np.dtypes.StringDType) # type: ignore
+ offsets = np.empty(shape, dtype=np.int32)


After #177 , these arrays will need to be uint64 instead of int32.

ghidalgo3 · 2024-07-19T20:15:20Z

virtualizarr/readers/hdf.py

+ manifest = _dataset_chunk_manifest(path, dataset)
+ if manifest:
+ chunks = dataset.chunks if dataset.chunks else dataset.shape
+ codecs = codecs_from_dataset(dataset)


Given the ZarrV3 spec on codecs being non-empty: https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#id12

this comment: use compressor, filters, post_compressor for Array v3 create zarr-python#1944 (comment) mapping filters and compressor to v3 concepts

And empirically we observed that on GOES data this builds of a list of zlib and FixedScaleOffset

Leaving compressor=None causes ambiguity for roundtripping v3 metadata (ZArray -> disk -> ZArray) because we can't determine if it's a list of 2 filters or a list of one filter and one compressor. zlib is a compression codec and FixedScaleOffset is not, but should they both be treated as filters?

@ghidalgo3 My rationale for describing the full codec chain in the filters property was the fact that internally HDF5 does not distinguish compressors and filters, the entire encoding chain is represented as filters. Since we don't need to worry about v2 interoperability, I think we can just focus with aligning with v3's api (which still seem to be in a state of flux). I think I prefer the approach proposed in zarr-developers/zarr-python#1944 (comment) but I don't know where that leaves me in the interim until a final decision gets made on the v3 api path 🤔. For v3 compatibility we'll also need to track zarr-developers/numcodecs#524 so we use numcodecs which are compatible with the new v3 codec specification. TLDR I think we might be in flux for some time while upstream v3 decisions get made.

sharkinsspatial · 2024-07-24T15:51:43Z

@ghidalgo3 I also want to address your question from your PR #193

Also, what happens if a source file uses a codec that is not one of the specified codecs of ZarrV3? Does that mean the file cannot be represented in ZarrV3? Seems rather onerous.

IIUC different v3 implementations will support a codec registry zarr-developers/zarr-python#1588 to make the codec support fully extensible. Codec discovery and registration has always been a thorny problem (this is a big issue in the HDF space) but I'm hopeful that this approach will be flexible.

sharkinsspatial · 2024-08-09T02:16:25Z

@TomAugspurger I'm trying to merge main into my branch so I can also investigate removing the Pydantic dependency in this branch and have my pre-commit working correctly again. But I seem to be hitting an issue with some change in zarr.py from #213. All of the roundtrip tests are failing on to_kerchunk_json with TypeError: array([xxx]) is not JSON serializable but I can't seem to find which of your changes is causing this 🤔 and I'm a bit stumped. I'm packing and moving countries this week 🇲🇽 -> 🇨🇦 so I'm probably missing something obvious 😄 . Any thoughts?

TomAugspurger · 2024-08-10T19:06:57Z

@sharkinsspatial this is a behavior change in ZArray.dict during the move away from pydantic.

Something like this seems to fix the failing tests

diff --git a/virtualizarr/zarr.py b/virtualizarr/zarr.py
index 824892c..87bb453 100644
--- a/virtualizarr/zarr.py
+++ b/virtualizarr/zarr.py
@@ -106,8 +106,15 @@ class ZArray:
 
     def to_kerchunk_json(self) -> str:
         zarray_dict = self.dict()
-        if zarray_dict["fill_value"] is np.nan:
+
+        fill_value = zarray_dict["fill_value"]
+
+        if fill_value is np.nan:
             zarray_dict["fill_value"] = None
+
+        elif isinstance(fill_value, (np.number, np.ndarray)):
+            zarray_dict["fill_value"] = fill_value.item()
+
         return ujson.dumps(zarray_dict)
 
     # ZArray.dict seems to shadow "dict", so we need the type ignore in

I'm not sure what behavior we want dict() to have. Whether it should faithfully return the value set, or whether it should attempt to cast to something else.

TomNicholas · 2024-08-10T21:22:08Z

The ZArray class was supposed to be a way to standardize the metadata, allowing the rest of the package to not worry about any differences that kerchunk throws at us.

The .dict() method we just got for free via inheriting from the pydantic base model.

I think we should migrate the interface of ZArray towards providing a unified representation of the metadata, and also try to move its API closer to that of zarr-python's ZMetaData class because really we want to be using that instead.

(Not sure if that actually answers your question)

sharkinsspatial added 4 commits April 19, 2024 13:31

Generate chunk manifest backed variable from HDF5 dataset.

6b7abe2

Transfer dataset attrs to variable.

bca0aab

Get virtual variables dict from HDF5 file.

384ff6b

Update virtual_vars_from_hdf to use fsspec and drop_variables arg.

4c5f9bd

sharkinsspatial marked this pull request as draft April 22, 2024 18:37

sharkinsspatial added 3 commits April 22, 2024 13:02

mypy fix to use ChunkKey and empty dimensions list.

1dd3370

Extract attributes from hdf5 root group.

d92c75c

Use hdf reader for netcdf4 files.

0ed8362

TomNicholas reviewed Apr 22, 2024

View reviewed changes

TomNicholas added enhancement New feature or request references generation Reading byte ranges from archival files labels Apr 22, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

f4485fa

for more information, see https://pre-commit.ci

sharkinsspatial mentioned this pull request Apr 23, 2024

How to handle encoding #68

Open

sharkinsspatial added 4 commits May 8, 2024 17:53

Merge branch 'main' into hdf5_reader

3cc1254

Fix ruff complaints.

0123df7

First steps for handling HDF5 filters.

332bcaa

Initial step for hdf5plugin supported codecs.

c51e615

TomNicholas mentioned this pull request May 14, 2024

open_virtual_dataset with dmr++ #113

Merged

6 tasks

sharkinsspatial and others added 10 commits May 16, 2024 16:24

Small commit to check compression support in CI environment.

0083f77

Merge branch 'main' into hdf5_reader

3c00071

[pre-commit.ci] auto fixes from pre-commit.com hooks

207c4b5

for more information, see https://pre-commit.ci

Fix mypy complaints for hdf_filters.

c573800

Merge branch 'hdf5_reader' of https://github.com/TomNicholas/Virtuali…

ef0d7a8

…Zarr into hdf5_reader

Local pre-commit fix for hdf_filters.

588e06b

Use fsspec reader_options introduced in #37.

725333e

Fix incorrect zarr_v3 if block position from merge commit ef0d7a8.

72df108

Fix early return from hdf _extract_attrs.

d1e85cb

Test that _extract_attrs correctly handles multiple attributes.

1e2b343

sharkinsspatial temporarily deployed to test-release June 30, 2024 02:02 — with GitHub Actions Inactive

sharkinsspatial added 2 commits June 30, 2024 13:07

Use decode_times for integration test.

d352104

Standardize fixture names for hdf5 vs netcdf4 file types.

3d89ea4

sharkinsspatial temporarily deployed to test-release June 30, 2024 19:39 — with GitHub Actions Inactive

Handle array add_offset property for compressed data.

c9dd0d9

sharkinsspatial temporarily deployed to test-release July 1, 2024 04:15 — with GitHub Actions Inactive

TomNicholas reviewed Jul 1, 2024

View reviewed changes

sharkinsspatial added 2 commits July 1, 2024 16:57

Include h5py shuffle filter.

db5b421

Make ScaleAndOffset codec last in filters list.

9a1da32

sharkinsspatial temporarily deployed to test-release July 1, 2024 23:05 — with GitHub Actions Inactive

Apply ScaleAndOffset codec to _FillValue since it's value is now down…

9b2b0f8

…stream.

sharkinsspatial temporarily deployed to test-release July 2, 2024 19:26 — with GitHub Actions Inactive

Coerce scale and add_offset values to native float for JSON serializa…

9ef1362

…tion.

sharkinsspatial temporarily deployed to test-release July 2, 2024 21:13 — with GitHub Actions Inactive

ghidalgo3 mentioned this pull request Jul 10, 2024

Add Zarr v3 dependency #182

Draft

9 tasks

TomNicholas mentioned this pull request Jul 16, 2024

Extend refspec support to [path] entries (without offset/length) #187

Merged

7 tasks

ghidalgo3 reviewed Jul 18, 2024

View reviewed changes

ghidalgo3 reviewed Jul 19, 2024

View reviewed changes

TomNicholas mentioned this pull request Jul 26, 2024

Splitting out lazy indexing layer and backends layer as zarr-python features pydata/xarray#9281

Open

sharkinsspatial mentioned this pull request Aug 2, 2024

Set ZArray default fill_value as NaT for datetime64 #206

Merged

sharkinsspatial added 3 commits August 6, 2024 16:46

Merge branch 'main' into hdf5_reader

30005bd

Merge branch 'main' into hdf5_reader

14f7a99

Temporarily xfail integration tests for main

f4f9c8f

ayushnag mentioned this pull request Aug 19, 2024

Use xarray's encode_cf / decode_cf functions to handle CF conventions #157

Open

This was referenced Aug 20, 2024

Xarray backend which loads data by default #221

Open

Internal refactor to separate reading and writing concerns #231

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

sharkinsspatial commented Apr 22, 2024 •

edited

Loading

TomNicholas left a comment

TomNicholas Apr 22, 2024

rabernat Apr 22, 2024

TomNicholas Apr 22, 2024

rabernat commented Apr 22, 2024

sharkinsspatial commented May 13, 2024

TomNicholas Jul 1, 2024

sharkinsspatial Jul 1, 2024

ghidalgo3 Jul 18, 2024

ghidalgo3 Jul 19, 2024

sharkinsspatial Jul 24, 2024

sharkinsspatial commented Jul 24, 2024 •

edited by TomNicholas

Loading

sharkinsspatial commented Aug 9, 2024

TomAugspurger commented Aug 10, 2024

TomNicholas commented Aug 10, 2024 •

edited

Loading

		@@ -0,0 +1,206 @@
		from typing import List, Mapping, Optional

		import fsspec

[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

Are you sure you want to change the base?

[Draft] Non-kerchunk backend for HDF5/netcdf4 files. #87

Conversation

sharkinsspatial commented Apr 22, 2024 • edited Loading

TomNicholas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Apr 22, 2024

sharkinsspatial commented May 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sharkinsspatial commented Jul 24, 2024 • edited by TomNicholas Loading

sharkinsspatial commented Aug 9, 2024

TomAugspurger commented Aug 10, 2024

TomNicholas commented Aug 10, 2024 • edited Loading

sharkinsspatial commented Apr 22, 2024 •

edited

Loading

sharkinsspatial commented Jul 24, 2024 •

edited by TomNicholas

Loading

TomNicholas commented Aug 10, 2024 •

edited

Loading